reproducible machine
NHANES-GCP: Leveraging the Google Cloud Platform and BigQuery ML for reproducible machine learning with data from the National Health and Nutrition Examination Survey
Katz, B. Ross, Khan, Abdul, York-Winegar, James, Titus, Alexander J.
Summary: NHANES, the National Health and Nutrition Examination Survey, is a program of studies led by the Centers for Disease Control and Prevention (CDC) designed to assess the health and nutritional status of adults and children in the United States (U.S.). NHANES data is frequently used by biostatisticians and clinical scientists to study health trends across the U.S., but every analysis requires extensive data management and cleaning before use and this repetitive data engineering collectively costs valuable research time and decreases the reproducibility of analyses. Here, we introduce NHANES-GCP, a Cloud Development Kit for Terraform (CDKTF) Infrastructure-as-Code (IaC) and Data Build Tool (dbt) resources built on the Google Cloud Platform (GCP) that automates the data engineering and management aspects of working with NHANES data. With current GCP pricing, NHANES-GCP costs less than $2 to run and less than $15/yr of ongoing costs for hosting the NHANES data, all while providing researchers with clean data tables that can readily be integrated for large-scale analyses. We provide examples of leveraging BigQuery ML to carry out the process of selecting data, integrating data, training machine learning and statistical models, and generating results all from a single SQL-like query. NHANES-GCP is designed to enhance the reproducibility of analyses and create a well-engineered NHANES data resource for statistics, machine learning, and fine-tuning Large Language Models (LLMs). Availability and implementation" NHANES-GCP is available at https://github.com/In-Vivo-Group/NHANES-GCP
- North America > United States > California > Los Angeles County > Los Angeles (0.15)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Information Technology > Services (1.00)
- Health & Medicine > Public Health (1.00)
- Education > Health & Safety > School Nutrition (0.87)
Moving towards reproducible machine learning - Nature Computational Science
An important step when constructing a model is the collection and selection of the datasets, as the quality of the model greatly depends on the quality and characteristics of the data. The data collection process needs to be properly discussed and reported, as there can be biases (intentional and/or unintentional) with regards to the selected data sources. Any identified biases and attempts to mitigate them should also be properly discussed, so that other researchers can be aware of the limitations when using the reported models. If synthetic data is used, the data generation process, including any assumptions that are considered, needs to be described in detail. Raw datasets are in fact rarely used, since they may have several inconsistencies, errors, and outliers that can ultimately impact the quality of the model. In addition, data might need to be converted to a specific format and representation in order to be used for a specific model.
In Search of a Common Deep Learning Stack
Web serving had the LAMP stack, and big data had its SMACK stack. But when it comes to deep learning, the technology gods have yet to give us a standard suite of tools and technologies that are universally accepted. The idea of a common "stack" upon which developers build – and administrators run -- applications has become popular in recent years. Blessed with a multitude of competing options, developers can be fearful of picking the "wrong" tools and technologies and being left on the dark side of a forked project. Administrators tasked with keeping the creations of developers running similarly are afraid of inheriting a technological albatross that weighs them down.
- North America > United States > New York > Kings County > New York City (0.05)
- North America > Canada > Ontario > Toronto (0.05)
- Africa (0.05)
Reproducible machine learning with PyTorch and Quilt
In this article, we'll train a PyTorch model to perform super-resolution imaging, a technique for gracefully upscaling images. Super-resolution imaging (right) infers pixel values from a lower-resolution image (left). Machine learning projects typically begin by acquiring data, cleaning the data, and converting the data into model-native formats. Such manual data pipelines are tedious to create and difficult to reproduce over time, across collaborators, and across machines. Moreover, trained models are often stored haphazardly, without version control.